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1 Notation 



Notation 




Description 






General 


1 A 




the indicator function for set A. 


B n 




Bell numbers. 






Evolutionary model 


71 




the number of individuals, and therefore the maximum possible 






number of species. 


k 




the number of species. 


t 




the origin time. 


f() 




the node density. 


m() 




the 'mixin' density. 


a() 




the prior density for t. 


A 




the speciation rate. 


u 




the extinction rate. 


a,B 




a = A — fj, and ft — /i/ A. 






User-chosen speciation parameters 


W 




the 'collapsing weight'. 






the 'collapsing height'. 


T 




the speciation threshold. 






Simulations, results 


G 




the number of loci. 


T 




the mutation rate in substitutions per site per generation. 


0 




the table of clusterings Z\, Z2, ■ ■ ■ Z z with corresponding posterior 






r>roba bili ties r>i i)o Tt~ 


z* 




the true clustering. 


M 




an estimated similarity matrix. 


M* 




the true similarity matrix. 






Rand index definition 


S = {oi, . . . , 


On} 


a set of n objects. 


X = {X U ... 


> x x }, 


a clustering of S into x subsets (clusters) . 


¥ = {¥,,.... 




a clustering of S into y subsets (clusters). 


A(Z,i,j) 




defined as 1 if i and j are in the same cluster in the clustering Z 






and 0 otherwise. 


f 2 (X,Y,j,k) 




defined as 1 1 if fi(X, is equal to fi(Y, i,j), else 0. 



2 Figures and tables 

These include some figures in the main text as well as extra ones, for convenient comparison. We 
define an alternative point estimator to be the clustering Z in Q which minimises D mat (Z, M(S1)). 
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2.1 Varying epsilon 

Note that the scales of the x-axes differ between Figs 2 and 3. 
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Figure 1: The boxplots show the values of the error metric D(Q, Z*) over ten replicates as G, T, 
and e vary. In the labels, G3T4epsl means the number of genes G is 3, the mutation rate T is 4e-8, 
and e is le-5 = 0.00001. The other values of e are 0.00003 and 0.0001. A beta prior with shape 
parameters 8 and 2 was used for w which is estimated. 
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Figure 2: Rand metric between point estimate Z and true clustering, as G, T, and e vary. 
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Figure 3: Rand metric between point estimate Z and true clustering, as G, T, and e vary. 
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Figure 4: Posterior probability of true clustering, as G, T, and e vary. 
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Table 1: Number of times out of 10 that the true clustering failed to be in the 0.95 credible set from a 
MCMC run over 20 mlliion generations with the first 10 million discarded as burnin., as G, T, and e vary. 
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2.2 Varying w 
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Figure 5: D(Q, Z*) as G, T, and w vary. The numbers after the 'k' in the label are the prior mean 
values for the number of species which are affected by the value of w The value of e is 0.0001. 
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Figure 6: Rand metric between point estimate Z and true clustering, as G, T, and w vary. 
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Figure 7: Rand metric between point estimate Z and true clustering, as G, T, and w vary. 
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Figure 8: Posterior probability of true clustering, as G, T, and w vary. 
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Table 2: Number of times out of 10 that the true clustering failed to be in the 0.95 credible set, as G, T, 
and w vary. 
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2.3 Single species 



Note that all the Rand metrics between Z and true clustering, and Z and true clustering, were 
zero, and the true clustering was always in the 0.95 credible set. 
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Figure 9: The true clustering is a single cluster. The plot shows D(Q, Z*) for varying number of 
individuals n and genes G. The value of e is 3e-5. A beta prior with shape parameters n-1 and 1 
was used for w which is estimated. 
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Figure 10: Posterior probability of true clustering, as n and G vary. 
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2.4 Thomomys data set 
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Figure 11: Similarity matrices for Thomomys data set under various e (e), r (t) and collapse weight 
(w) values. 
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3 Cluster analysis for systematists 



Species delimitation is a type of cluster analysis. Cluster analysis is used in many areas of science, 
including machine learning and data mining, as well as various areas within bioinformatics, but it 
may be unfamiliar to some systematists. Different areas of science use different terminology, and 
one potential confusion seems worth pointing out. In most areas of science a sharp distinction is 
made between classification and cluster analysis. 'Classification' is used to mean only the 
assignment of new specimens to one of a fixed set of known classes. The assignment of a 
handwritten symbol to one letter in an alphabet is a typical example of classification. Cluster 
analysis means the discovery or invention of clusters among specimens. These clusters might later 
be regarded as classes into which new specimens can be classified, but this is no longer cluster 
analysis. 

There is a huge variety of clustering algorithms, and also many ways of evaluating clusterings, and 
measuring the differences between them. In this paper, we use one of the simplest, the Rand 
index [6]: 

Given a set of n elements S — {o\, . . . , o„} and two partitions of S to compare, 

X = {X\, . . . , X r }, a partition of S into r subsets, and Y — {Yi, . . . , Y s }, a partition of 

S into s subsets, define the following: 

• a, the number of pairs of elements in S that are in the same set in X and in the 
same set in Y 

• b, the number of pairs of elements in S that are in different sets in X and in 
different sets in Y 

• c, the number of pairs of elements in S that are in the same set in X and in 
different sets in Y 

• d, the number of pairs of elements in S that are in different sets in X and in the 
same set in Y 

The Rand index, R, is: 

a + b _ a + b 
~ a + b + c + d ~ ~(f)~ 

This comes from Wikipedia where more information can be found. 



3.1 Online resources 



http : //en . wikipedia . org/wiki/Cluster_analysis 
http : //en . wikipedia . org/wiki/Rand_measure 

A book by Jain and Dubes [5]: 

homepages . inf . ed . ac . uk/rbf /BOOKS/ JAIN/Clustering_Jain_Dubes . pdf 



4 Calculations 

4.1 Node height density 

Conditioned on k, t, A, and /i the density of an unordered node height s is given by Theorem 2.5 of 

[3] as 

/(-IMAM) = ; A _^- ( ,- M)a)2 !_%-»)« V*] 



9 



Setting a = X — /j, and f3 = fi/X so that \i = (3X and A = a/(l — this can be rewritten as 

/(-IM,A,M) = (1 _^l as)2 l[o,t] (2) 

where s is a node height. Note that k does not appear on the right hand side. 

4.2 Origin height density 

Usually an improper uniform prior on [0, oo) is assumed for the origin time t of the tree, which is 
then conditioned on the number of species k. This conditional density for t is shown in Theorem 
3.2 of [3] to be 

(1 - e -{\-ij)t\k-i e -(\-n)t 

•«*> - { x-,J-^ 

= Ml-/3)e- at ( \_^_J )fc+1 (3) 

Using the probabilities from expression (3) in the main paper (the one starting (\C (k)\) . . .) , the 
prior density for t is 

?(*) = E ( 1 1 J) (! - «') fc " 1 «' n - fc «(*l*) (4) 



fc=i 

This can be expressed as 

n-l 



«(*) = E • (i-^^-'-Mtli + i) 

n V 3 / 



0=0 



'n-lY.. n -i-iA 1 -«')( 1 - e - Qt ) V 



E f ') (i - + i)«(i - ^)«-" (1 % e - a t)L 

«(l-/3) e - at (l-/3 e - Q *)- 2 ^(. ? + l) 

«E(i + 1 )( n "% n " 1 "^' 

j=o \ 3 J 



w 



3 J V We 



where 
Then 



a = a(l-/3)e- at (l-^e- at )- 2 and b = (1 - w)(l - e- at )(l ~ /3e" a *)- ■ (5) 



«(*) = ME^ ■ V-^ + Ey" • V" 1 ^ 
\j=o \ 3 J J=0 \ 3 J 

= a^(w + + g(i + 1) ^ " t^" 8 "'^ 



= a ^( W + 6)"- 1 + 6^(n-l)^ n . ^m"" 2 "^ 

= a ((w + ft)™" 1 + (n - l)&(u> + &)™~ 2 ) 
= a(w + fe)"- 2 ((w + 6) + (n-l)fe) 

= a(w + b) n - 2 (w + nb). (6) 

Equations (1) and (6) provide the basis for a new class BirthDeathCollapseModel extending 
SpeciationModel. It contains a parameter for the origin height t as well as for a and (3 as in the 
usual birth-death model. 
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4.3 Distance between similarity matrices 



We start with an explicit formula for the Rand index. Suppose there is a set of n objects 
S = {oi, . . . , o„} and two clusterings of S to compare, X = {X\, . . . , X x }, a clustering of S into x 
clusters; and Y — {Yi, . . . , Y y }, a clustering of S into y clusters. For any clustering Z of S, and 
1 < i < j < n, define fi(Z, to be 1 if i and j are in the same cluster in the clustering Z and 0 
otherwise. Next define f2(X,Y,i,j) to be 1 if fi(X,i,j) is equal to fi(Y,i,j), and 0 otherwise. In 
words, f2(X, Y, is 1 if the two clusterings X and Y 'agree' on the pair (i, j) and 0 if they 
'disagree'. Then the Rand index is: 



R(X,Y) 



-1 n 



]T f 2 (X,Y,i,j) 



i,j:i>j 



Recall that M — M(Ct) is the similarity matrix and M* is the true similarity matrix. 

-1 n 



D mat (M,M*) 



m— 1 



E i M > 

i,j:i>j 
-1 n 

E 

l,j:i>j 
-1 n 

E 

l,j:i>j 
-1 n 

E 

i,j:t>j 
-In z 

E E^( 1 -^( z ™' z *' j ^')) 

2,,7 m— 1 
-1 n 

g (l-f 2 (Z m ,Z*,i,j)) 

i,j:i>j 



yY2p m fi(z m ,i,j)J -fi(Z*,i,j) 

z z 

E Pmfl(Z m ,i,j) - E Pmfl(Z*,i,j) 
m—1 m.— l 

z 

E Pm(h(Z m ,i,j) - fi(Z*i,j)) 



m—1 



= E p™ 1 



m—1 



^ f 2 (Z m ,Z*,i,j) 



i,j:i>j 



= J2p m (l-R{Z m ,Z*)) 



m—1 



— PmR{Zm, 
m—1 

= D(n). 



(7) 



5 Usage 

The model is implemented in the development version vl.8.0pre of BEAST ([1], [2]), more 
specifically the multispecies coalescent model *BEAST [4]. 
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5.1 Running a Dissect analysis 



BEAUTi can be used to set up most of the analysis. Each species should only be assigned 
individuals which definitely belong together, since the program will consider merging but never 
splitting these minimal clusters. The word 'species', as it appears in BEAUTi and in the BEAST 
XML, should be interpreted as a minimal cluster. 

Two changes need to be made to the XML. The birthDeathModel must be replaced with a 
birthDeathCollapseModel, and an operator must be added for the origin height. 

The value of w can be set using the element collapseWeight. It can cither be fixed or estimated. 
Even when w is fixed, the prior on the number of species k is somewhat diffuse. If you want to 
estimate w, you can add a hyperprior and an operator for it. The model does not permit k to have 
a more concentrated prior than the binomial distribution decribed above, but you can obtain a 
more diffuse prior on k with a suitable hyperprior for w. 

The value of e is set in the birthDeathModel using the attribute collapseHeight. 

Optionally tmrcaStatistics may be added for groups of interest. A statistic 
bdcNClustersStatistic for the number of clusters can also be added. A trace of the latter can 
indicate if the MCMC chain is having trouble changing the number of almost collapsed nodes. 

The Yule model for speciation can be used with Dissect, by setting the initial values of 

species. birthDeathCollapse . relativeDeathRate to 0, and removing the operator wich changes 

it. 

XML for the birth-death-collapse model 

•CbirthDeathCollapseModel id= "birthDeathCollapse" unit s=" substitutions" 
collapseHeight="0.0001"> 
<speciesTree> 

<speciesTree idref ="sptree"/> 
</ speciesTree> 
<birthMinusDeathRate> 

<parameter id= "species .birthDeathCollapse .meanGrowthRate" 
value="100" lower="0.0" upper="Inf inity"/> 
</birthMinusDeathRate> 
<relativeDeathRate> 

<parameter id=" species .birthDeathCollapse . relativeDeathRate" 

value="0.5" lower="0.0" upper=" 1 . 0" /> 

</relativeDeathRate> 
<originHeight> 

<parameter id=" species .birthDeathCollapse . originHeight" 
value="0.2" lower="0.0" upper="Inf inity"/> 

</ originHeight> 
<collapseWeight> 

<parameter id=" species .birthDeathCollapse . collapseWeight" 
value="0.5" lower="0.0" upper=" 1 . 0"/> 

</ collapseWeight> 
</birthDeathCollapseModel> 

XML for the operator for origin height 

<scaleOperator scaleFactor="0 . 75" weight="3"> 

<parameter idref =" species. birthDeathCollapse . originHeight "/> 
</scaleOperator> 

XML for the operator for w (if w is estimated) 

<scaleOperator scaleFactor="0 . 75" weight="3"> 

<parameter idref =" species .birthDeathCollapse . collapseWeight"/> 
</scaleOperator> 
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XML for a number-of-clusters statistic 

•CbdcNClustersStatistic id="nClusters" name="nClusters"> 
<collapseModel> 

<birthDeathCollapseModel idref ="birthDeathCollapse"/> 
</collapseModel> 
<speciesTree> 

<speciesTree idref ="sptree"/> 
</speciesTree> 
</bdcNClustersStatistic> 

XML for a tmrca statistic 

•CtmrcaStatistic id="speciesTree . talpoidesHeight " name="speciesTree . talpoidesHeight"> 
<speciesTree idref ="sptree"/> 
<mrca> 
<taxa> 

<sp idref ="Thomomys_talpoidesl"/> 
<sp idref ="Thomomys_talpoides2"/> 
<sp idref ="Thomomys_talpoides3"/> 
</taxa> 
</mrca> 
</tmrcaStatistic> 

XML for a prior and an operator for w (if w is estimated) 

This goes in the prior element, in the mcmc element: 
<betaPrior shape="5" shapeB="2"> 

<parameter idref =" species. birthDeathCollapse . collapseWeight"/> 
</betaPrior> 

This goes in the operators element: 
<scaleOperator scaleFactor="0 . 75" weight="3" > 

<parameter idref =" species .birthDeathCollapse . collapseWeight" /> 
</scaleDperator> 

5.2 Using SpeciesDelimitationAnalyser 

You can run this tool using a command line like that for BEAST, with dr.app. beast .BeastMain 
replaced by dr.app. tools. SpeciesDelimitationAnalyser. 

The last two arguments are always an input file name and an output filename. The input file is the 
MCMC samples of species trees. The output is a table which lists the clusterings which are found 
when the nodes with small height have been collapsed. There are three optional arguments, shown 
in this example: 

SpeciesDelimitationAnalyser -burnin 10000 -collapseheight .001 
-simcutoff .95 treesamples.txt out.txt 

• burnin is the number of samples to be considered as 'burn-in' which will be ignored. 

• collapseheight is the height below which nodes get collapsed. This would normally be equal 
to or larger then the value of e used in the BEAST analysis. 

• simcutoff is the value above which two clusterings are regarded as similar enough to support 
one another's credibility. It may be useful to change this when summarising the results as a 
single clustering. The similarity between two clusterings is the Rand index, and is a value 
between 0 and 1. The idea is that a clustering may not be very common among the samples, 
but may be similar to a lot of other clusterings, and the value of simcutoff affects which 
clustering is regarded as the 'best' summary of the results. The summary is somewhat 
analogous to a maximum clade credibility tree. Smaller values of simcutoff mean taking into 
account more dissimilar clusterings. This option should be regarded as experimental: set it 
to 1.0 if you want to ignore it. 
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Here is an imaginary example of the output to illustrate the format. 



count 


fraction 


similarity 


nclusters 


a 


b 


c 


d 


6000 


0.6 


6000.0 


1 


0 


0 


0 


0 


3000 


0.3 


3000.0 


2 


0 


1 


1 


1 


1000 


0.1 


1000.0 


3 


2 


0 


1 


2 



The count column is the number of samples containing the clustering. The fraction is this count 
divided by the number of samples. The similarity column is a measure of how similar the 
clustering is to others. 

The remaining columns describe a clustering. The nclusters column is the number of clusters. The 
other columns are labelled with the names of the 'species' (minimal clusters) used in the BEAST 
XML for the analysis. The numbers in the columns show how these are grouped; the numbers 
themseves are arbitrary, but if two columns have the same value, the minimal clusters are grouped 
together in the clustering. So the table above shows clusterings abed, a+bed and ad+b+c. 

The Rand indices between these clusterings are R(abcd, a + bed) = 3/6, R(abcd, ad + b + c) = 2/6, 
R(a + bed, ad + b + c) = 2/6. If simcutoff is above 1/2, the similarity column will be unaffected. If 
simcutoff is between 1/3 and 1/2, the first two clusterings will support one another. 

The set of clusterings can be used to produce into a similarity matrix, which contains the posterior 
probabilities of each pair of minimal clusters belong to the same cluster. See table 3 for the 
similarity matrix corresponding to the imaginary example, and the main paper for a larger 
example. 





a 


b 


c 


d 


a 


1 


.6 


.6 


.7 


b 


.6 


1 


.9 


.9 


c 


.6 


.9 


1 


.9 


d 


.7 


.9 


1 


.9 



Table 3: Similarity matrix 



6 Obtaining a development version of BEAST 

You will need: 

• Java and a Java development kit (JDK) installed. 

• Subversion installed, http://subversion.apache.org/ 

• Ant installed, http://ant.apache.org/ 

6.1 Linux and Mac OS X instructions 
6.1.1 Check out the code 

Type 

svn checkout http://beast-mcmc.googlecode.com/svn/trunk/ beastl6 

beastl6 names a directory; you can use another name. You should see a long listing of files ending 
something like 
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A beastl6/ examples/incorrect/testOTFPCLikelihood.xml 

U beastl6 
Checked out revision 4366. 



6.1.2 Build BEAST 



cd to beastl6 and type ant. You should sec something like this: 

[gjones@pcl58250 beastl6]$ ant 
Buildfile: build. xml 



clean: 



init : 

[echo] BEAST: /home/gj ones/beast 16/build. xml 



compile-all : 

[mkdir] Created dir: /home/gjones/beastl6/build 

[javac] Compiling 1836 source files to /home/gj ones/beast 16/build 

[javac] Note: Some input files use or override a deprecated API. 

[javac] Note: Recompile with -Xlint : deprecation for details, 

[javac] Note: Some input files use unchecked or unsafe operations, 

[javac] Note: Recompile with -Xlint : unchecked for details, 

[echo] Successfully complied. 



dist-all : 

[mkdir] Created dir: /home/gjones/beastl6/build/dist 

[jar] Building jar: /home/gjones/beastl6/build/dist/beast . jar 
[jar] Building jar: /home/gjones/beastl6/build/dist/beauti . jar 

build: 

BUILD SUCCESSFUL 
Total time: 37 seconds 
[gjones@pcl58250 beastl6]$ 

6.1.3 Copy some 'properties' files from source directory to build directory 

Type 

cp beast 16/src/dr/ app/beast/* .properties beast 16/build/dr/app/beast/ 



6.2 Windows instructions 

You have to do the same steps: download the code, compile it, and run it via a java command. I 
use TortoiseSVN which is a GUI for Subversion and Eclipse which is an IDE for Java. This is 
overkill if you just want to run a development version of BEAST, but I have not tried any other 
way. 



6.3 Test BEAST is working 

In order to run BEAST you need to execute a java command and supply a 'classpath' which is in 
this case a list of three places (. meaning 'here', /home/gj ones/beast 16/build/, and 
/home/gj ones/beast 16/lib/*) to tell java where to look for code. Then there is the java class 
BeastMain to run and the XML file YuleSingleLocus . xml which is input to BEAST. Under 
Linux, all as one line: 
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java -classpath ' . : /home/gj ones/beast 16/build/ :/home/gj ones/beast 16/lib/*' 

dr . app . beast . BeastMain beast 16/examples/release/ starBEAST/YuleSingleLocus . xml 

Under Windows, the classpath is separated by ; not : and the code is a different place (bin not 
build) and file paths use \ not / and the single quotes don't seem to be needed. It might look like 
this: 

java -classpath . ; C : \workspace\beastl6\bin; C : \workspace\beastl6\lib/* 

dr . app . beast . BeastMain beast 16\examples\release\starBEAST\YuleSingleLocus . xml 

You should see a normal BEAST run. If you get an error like this 

Exception in thread "main" java.lang.NoClassDefFoundError: 
jebl/evolution/treemetrics/RootedTreeMetric 

it is probably a problem with the classpath and/or the directory you are in when you issue the 
command. 



1G 



7 R code 



7.1 R code for density of origin 

origindensity <- function(x, n, a, b, w) { 

E <- exp(-a*x) 

B <- (1 - E) / (l-b*E) 

z <- 1 

z <- z * a 

z <- z * (1-b) 

z <- z * E 

z <- z / (l-b*E)~2 

z <- z * (w + (l-w)*B)~(n-2) 

z <- z * (w + n*(l-w)*B) 

z 

> 



log. origindensity <- function(x, n, a, b, w) { 

E <- exp(-a*x) 

B <- (1 - E) / (l-b*E) 

z <- 0 

z <- z + log(a) 

z <- z + log(l-b) 

z <- z - a*x 

z <- z - 2 * log(l-b*E) 

z <- z + (n-2) * log(w + (l-w)*B) 

z <- z + log(w + n*(l-w)*B) 

z 

> 

n <- 50 
a <- 1 
b <- 0.9 
w <- 0.5 

catC'mean nof species in prior is " , 1 + (n-l)*(l-w), "\n") 
x <- seq(from=0, to=10, length. out=1001) 

matplot(x, cbind(log(origindensity(x, n, a, b, w)), log. origindensity (x, n, a, b, w) ) , type="l") 
integrate (origindensity, lower=0, upper=Inf , n=n, a=a, b=b, w=w) 



7.2 R code for prior probabilities of number of species 

If w has a beta distribution for its hyperprior, with parameters a and b (shape and shapeB in the 
XML) and there are n individuals, then the prior probabilities of number of species is given by 

Prffc - x\n a b) I» T(x - 1 + b)T(n x + a) T(a + b) 

{ 1 ' ' ;_ T(x)T(n+l-x) T(n-l + a + b) T(a)T(b) 

and the function px below calculates this, for 1 < x < n. 
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library (VGAM) 

px <- function(x, n, a, b) { 

dbetabinom. ab(x-l , size=n-l, shapel=b, shape2=a) 
} 

# Alternative, without VGAM: 
px2 <- function(x, n, a, b) { 
p <- 1 

p <- p * gamma(n) 
p <- p / gamma(x) 
p <- p / gamma (n+l-x) 
p <- p * gamma(x-l+b) 
p <- p * gamma (n-x+a) 
p <- p / gamma (n-l+a+b) 
p <- p * gamma(a+b) 
p <- p / gamma(a) 
p <- p / gamma(b) 

P 

> 



z <- px(l:25, 25, 8,2) 
z2 <- px2(l:25, 25, 8,2) 
plot(z) 
z 

z2 
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7.3 R code for displaying the similarity matrix 



# Read in the table of clusterings 

workdir <- "C : /Users/Graham/ AAA/Programming/tmp/birthdeathcollapse/w70e0001/" 
x <- read. table (paste (workdir, "SDA_OUT.txt", sep=""), header=TRUE) 

# Abbreviations for display 
renames <- matrix(c( 
"Orthogeomys_heterodusl" , "0 .net" , 



x xx win \j ill y o 


bottael " 

_ VJ \J U O CLC X , 


ii 


.bot awa-a" , 


" Tin nmnmv^ 

x xx will w 111 y o 


bottaelO" 

u u CX \3 X w , 


"T.bot mew", 


" Th ninnniv^ 

x 11 wxii wxu y o 


hottapl 1 " 

_ VJ KJ U \j CLC XX , 


"T.bot sax", 


" Th nmnmvs 

x xx will w ill y o 


bottael2" 


"T.bot lat", 


" Tin nmnniv^ 

x xx will \J ill y o 


bottae2" 


1 


. bot awa - b" , 


" Tliomoniy s 


bottae3" , 


II T 1 
1 


. dot. xer , 


" Th nmnnivc; 
x 11 wxii wxii y o 


bottap4" 


11 J 


.bot cac", 


" Tin nmnmv^ 

x xx will \J ill y o 


bottae5" 

VJ \J U O CX \3 W , 


II J 


.bot alb", 


" Tin nmrimv^ 
x xxwxiiwiii y o 


hnttapfi " 


ii -j- 


.bot rui", 


" Th nmnmvs 

x xx will w ill y o 


bottap7" 

_ VJ \J U O CL C 1 , 


ii -j- 


.bot bot", 


"Thomomys 


_bottae8" , 


ii j 


.bot alp", 


"Thomomys 


_bottae9" , 


ii y 


.bot rip", 


"Thomomys 


.idahoensisl 


ii 

9 


"T.ida pyg-a" , 


"Thomomys 


_idahoensis2 


II 

> 


"T.ida pyg-b", 


"Thomomys 


.mazamal" , 


1! -J- 


. maz maz " , 


"Thomomys 


_mazama2" , 


II J 


.maz nas" , 


"Thomomys 


.monticolal" 


9 


"T.mon-a" , 


"Thomomys 


_monticola2" 


9 


"T.mon-b" , 


"Thomomys 


.talpoidesl" 


i 


■'T.tal oci", 


"Thomomys 


_talpoides2" 


9 


"T.tal yak", 


"Thomomys 


_talpoides3" 


3 


"T.tal bri", 


"Thomomys 


.townsendiil 


II 

9 


"T.tow tow", 


"Thomomys 


_townsendii2 


II 

9 


"T.tow rel", 


"Thomomys 


.umbrinusl" , 




'T.iimb chi", 


"Thomomys 


_umbrinus2" , 




'T.umb atr") , 



nrow=26, ncol=2, byrow=TRUE) 

# Minimal cluster names are column names omitting first 4 
mincl. names <- colnames(x) [-(1 :4)] 

# Check for typos, etc 

for (i in 1 : length (mincl .names) ) { 

stopif not (mincl .names [i] == renames [i , 1] ) 
} 
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# Make the similarity matrix 
displaynames <- renames [, 2] 
nmincls <- length (displaynames) 

sim <- matrix (0, ncol=nmincls , nrow=nmincls , dimnames=list (displaynames , displaynames)) 
for (i in l:nmincls) { 
for (j in l:nmincls) { 

coli <- x [ ,mincl .names [i] ] 

colj <- x [ ,mincl .names [j] ] 

w <- coli == colj 

sim[i,j] <- sum(x [w, "fraction"] ) 

> 

} 

# ensure rounding errors don't make probabilities sum to more than 1. 
sim <- pmin(sim,l) 

# change the order of minimal clusters 

# This depends on which patterns you want to emphasise 

neworder <- c(ll ,7 ,8, 9 , 10, 13, 3, 2 ,6,4,5, 12 , 24,23, 26,25, 16,17, 20,21,22, 1, 14,15, 18,19) 

# Currently recognised groups 
dividers <- c(0, 12, 14, 16, 18,21 ,22,24,26) 

plot. rectangle <- f unction (vl ,v2 ) 
{ 

polygon(c(vl[l] ,v2[l] ,v2[l] ,vl[l]), c(vl[2] ,vl[2] ,v2[2] ,v2[2]), . . .) 
} 

# Main plotting routine 

plot . simmatrix <- functionO { 
par(mar= c (0,5, 5,0) + . 1) 

plot(NULL, xlim=c(0, nmincls) , ylim=c (nmincls, 0) , axes=FALSE, ylab="", xlab="") 
axis(3, at=(l :nmincls)- . 5 , displaynames [neworder] , tick=FALSE, las=2, line=-l) 
axis(2, at=(l :nmincls)- . 5 , displaynames [neworder] , tick=FALSE, las=2, line=-l) 
for (i in l:nmincls) { 
for (j in l:nmincls) { 

d <- 1 - sim [neworder [i] ,neworder [j] ] 

plot. rectangle (c(i-l, j-1) , c(i,j), col=rgb(d,d,d) , border="white") 

} 

} 

for (b in dividers) { 

lines(x=c(-. 5, nmincls) , y=c(b,b)) 
lines(x=c(b,b) , y=c(-. 5, nmincls)) 

> 

} 

# Display as text, on screen, and to PDF file 
print (sim [neworder .neworder] , digits=2) 

plot . simmatrixO 

pdf (f ile=paste(workdir, "simmatrix.pdf", sep="")) 
plot . simmatrixO 
dev. off () 
} 
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8 Distribution on partitions from prior 



Probability distributions on the partitions of a finite set are of general interest. Some of these 
distributions arise as the result of a stochastic process. The Chinese restuarant process (see 
Wikipedia) is one example, and there are also a variety of preferential attachment processes (see 
Wikipedia). One of the latter is known as 'the Yule process' [7] which models the emergence of new 
genera among species. (Note that this is an extension to the pure birth model for speciation, which 
is also called a Yule process or Yule model.) None of these seems to be the same as the distribution 
associated with the model used here. The following result gives the probability of a given clustering 
in the model used here, conditioned on the number of clusters. The point process described in [3] 
shows why the merging process in the proposition produces the same distribution as sampling a 
tree using the method described in the main paper (sampling from / and (1 — w)f + wm), and 
then slicing through the tree. The result is almost surely already published somewhere, but I can't 
find it. It may be useful for calculating Bayes factors for different species delimitation hypotheses. 

Proposition. Assume a merging process which starts with n objects each assigned to its own 
cluster. Clusters are then merged by randomly choosing pairs among those available at each step 
until there are r clusters. Let b\ , b 2 , • • ■ , W be the sizes of the clusters in a particular clustering B 
produced by this process. Then 

Pr{B\r) 



= r! 



r!(n-r)!(r- l)!6i! 6 2 !...6 r ! 
(n -l)ln\ 



n 
r - 1 



61,62 



Proof. First we count the total number T of ordered mergings that make r clusters from 
{1, 2, . . . , n}. Then we count the number S of ordered mergings that make B and obtain 
Pr(_B|r) = S/T. It may help to think of a particular example of B such as 

B = { {1,2, .. .,&!>, {61 + I, &i+2,...,6 1 +6 2 },...,{n-6 J . + l, n-b r +2, . . . ,n} }. 

We have 

n\ f n — 1\ (n — r 

V\ 2 )"\ 2 
n\ {n-l)\ 
r! (r- 1)! 

For S, we multiply the numbers of ways in which each the clusters of sizes 61, 6 2 , . . . , b r can be 
formed by the number of ways that the mergings of different clusters can be interleaved. There are 
b — 1 mergings for a cluster of size 6, so there are 

n — r 

h - 1,62 - 1, • • • A - 1 

interleavings, and 

n-r \ bi\(bi - 1)! 



n 

7 i=l 



6i-l,62-l,.-.A-VAi 2bi_1 

r 

= (n-r)!2"- r Jj6i! 

i=i 

and the result follows. □ 
Suppose that there are Sj clusters of size i (1 < i < n). Note that X)"=i s i = r - Then there are 

/ „ \ -1 



n 

61,62, ... ,6, 



n-<« 

\i=l / 
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partitions of {1,2, ... ,n} which have the same shape (the same partition of the integer n). So the 
probability that the r clusters have a given shape is 




Example. Suppose n — 8, r = 3. Then 



Pr(4 + 2 + 2) = 3! (0! 2! 0! 1! 0! 0! 0! 0!)" 1 = 1/7 



and there are 



(0!2!0!1!0!0!0!0!)~ 1 = 210 



.4,2,2 

partitions of {1,2,..., 8} which have this shape. 

The Chinese restuarant process is different from the merging process above. The probabilities for 
n = 6 arc shown in table 4. When conditioned on r = 2, the Chinese restuarant process gives 
probabilities in ratio 72:45:20 for 5+1,4+2,3+3, whereas the merging process gives 2:2:1. When 
conditioned on r — 3, the Chinese restuarant process gives 6:8:1 for 4+1+1,3+2+1,2+2+2, 
whereas the merging process gives 3:6:1. 



Table 4: Chinese restuarant process 



Shape 


P (partition) 


#ptns per shape 


P(shape) 


6 


120/720 


1 


120/720 


5+1 


24/720 


6 


144/720 


4+2 


6/720 


15 


90/720 


3+3 


4/720 


10 


40/720 


4+1+1 


6/720 


15 


60/720 


3+2+1 


2/720 


60 


120/720 


2+2+2 


1/720 


15 


15/720 


3+1+1+1 


2/720 


20 


40/720 


2+2+1+1 


1/720 


45 


45/720 


2+1+1+1+1 


1/720 


15 


15/720 


1+1+1+1+1+1 


1/720 


1 


1/720 
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